1
過渡至生產環境:部署思維
EvoClass-AI002Lecture 10
00:00

過渡至生產環境:部署思維

本模組作為最後一環,彌合了成功研究(在筆記本中達成高準確率)與可靠執行之間的差距。部署是將PyTorch模型轉化為一個極簡、 自包含服務 ,能以低延遲高效地向終端用戶提供預測服務,並具備 高可用性

1. 產出思維的轉變

Jupyter筆記本的探索式環境具有狀態且對生產環境而言相當脆弱。我們必須將原始的探索性腳本重構為結構化、模組化的元件,使其適合處理併發請求、資源優化,並可順利整合至更大的系統之中。

低延遲推論: 持續將預測時間控制在目標閾值以下(例如 $50\text{ms}$),對於即時應用至關重要。
高可用性: 設計服務時應確保其穩定、無狀態,並能在故障後迅速恢復。
可重現性: 確保部署的模型與環境(依賴項、權重、設定)與已驗證的研究結果完全一致。
重點:模型服務
我們不應部署完整的訓練腳本,而是部署一個極簡、自包含的服務封裝。此服務只需處理三項任務:載入最佳化的模型檔案、執行輸入前處理,以及執行前向傳播以返回預測結果。
inference_service.py
TERMINALbash — uvicorn-service
> Ready. Click "Simulate Deployment Flow" to run.
>
ARTIFACT INSPECTOR Live

Simulate flow to view loaded production artifacts.
Question 1
Which feature of a Jupyter notebook makes it unsuitable for production deployment?
It primarily uses Python code
It is inherently stateful and resource-intensive
It cannot directly access the GPU
Question 2
What is the primary purpose of converting a PyTorch model to TorchScript or ONNX before deployment?
Optimization for faster C++ execution and reduced Python dependency
To prevent model theft or reverse engineering
To automatically handle input data preprocessing
Question 3
When designing a production API, when should the model weights be loaded?
Once, when the service initializes
At the start of every prediction request
When the first request to the service is received
Challenge: Defining the Minimal Service
Plan the structural requirements for a low-latency service.
You need to deploy a complex image classification model ($1\text{GB}$) that requires specialized image preprocessing. It must handle $50$ requests per second.
Step 1
To ensure high throughput and low average latency, what is the single most critical structural change needed for the Python script?
Solution:
Refactor the codebase into isolated modules (Preprocessing, Model Definition, Inference Runner) and ensure the entire process is packaged for containerization.
Step 2
What is the minimum necessary "artifact" to ship, besides the trained weights?
Solution:
The exact code/class definition used for preprocessing and the model architecture definition, serialized and coupled with the weights.